Assignment 2

SHS253@pitt.edu

What feature are important to determine the status of visa?

In [2]:
#data processing 
import pandas as pd 
import seaborn as sns
#special visulaization
import missingno as msno
import matplotlib.pyplot as plt
import collections
import warnings
warnings.filterwarnings('ignore')
from wordcloud import WordCloud
import requests
import imageio
from io import BytesIO
import random
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
%matplotlib inline
In [3]:
from IPython.display import Image
Image(filename = "results.png")
Out[3]:

Legend

As immigrants in USA, the biggest concern is regarding the status of our VISA. There are many factors impacting the results of decision. The dataset has around 154 features but many of these features have no impact on decision.I focused on 18 features out of the 154 as alot of them were either not very insightful or had missing data. As a naive to analysis, I am working on creating an optimized model and visualize it so that the techie(more than 60% of the people applying are software engineers ) plan their visa process. In this process, I visualized the features to come up with insightful pattern but the most optimized result were generated by using XGBoost model(Machine learning model)to determine which feature is relavent as shown in figure 1 of the graph. The other two graphs are the top 2 features which depicts that if the decision is between March to September, there is more than 90% chances of acceptance whereas in rest of the month its less than 20%. Similarly, if the case was started between November to February there is more than 80% chance of acceptance than in other months.These individual interpretation with help of the model can result in generating better possibility of getting the VISA accepted.

In [4]:
fields=['application_type', 'case_status', 'class_of_admission','country_of_citizenship','decision_date','employer_state','employer_name','job_info_work_city','pw_soc_title','us_economic_sector','case_received_date']
data=pd.read_csv('us_perm_visas.csv',usecols=fields)
df= data[['application_type', 'case_status', 'class_of_admission','country_of_citizenship','decision_date','employer_state','employer_name','job_info_work_city','pw_soc_title','us_economic_sector','case_received_date']].copy()
In [5]:
df.head()
Out[5]:
application_type case_status class_of_admission country_of_citizenship decision_date employer_state employer_name job_info_work_city pw_soc_title us_economic_sector case_received_date
0 PERM Certified J-1 ARMENIA 2/1/2012 NY NETSOFT USA INC. New York Computer Software Engineers, Applications IT NaN
1 PERM Denied B-2 POLAND 12/21/2011 NY PINNACLE ENVIRONEMNTAL CORP New York Hazardous Materials Removal Workers Other Economic Sector NaN
2 PERM Certified H-1B INDIA 12/1/2011 VA SCHNABEL ENGINEERING, INC. Lutherville Civil Engineers Aerospace NaN
3 PERM Certified B-2 SOUTH KOREA 12/1/2011 NY EBENEZER MISSION CHURCH Flushing File Clerks Other Economic Sector NaN
4 PERM Certified L-1 CANADA 1/26/2012 NY ALBANY INTERNATIONAL CORP. Albany Sales Engineers Advanced Mfg NaN
In [6]:
df.shape
Out[6]:
(374362, 11)
In [7]:
df.columns
Out[7]:
Index(['application_type', 'case_status', 'class_of_admission',
       'country_of_citizenship', 'decision_date', 'employer_state',
       'employer_name', 'job_info_work_city', 'pw_soc_title',
       'us_economic_sector', 'case_received_date'],
      dtype='object')
In [8]:
df.isnull().sum()
Out[8]:
application_type          239093
case_status                    0
class_of_admission         22845
country_of_citizenship        59
decision_date                  0
employer_state                42
employer_name                 12
job_info_work_city           102
pw_soc_title                2336
us_economic_sector        245294
case_received_date        135271
dtype: int64

Data Cleaning

In [9]:
msno.matrix(df)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x17317074be0>
In [10]:
df.describe()
Out[10]:
application_type case_status class_of_admission country_of_citizenship decision_date employer_state employer_name job_info_work_city pw_soc_title us_economic_sector case_received_date
count 135269 374362 351517 374303 374362 374320 374350 374260 372026 129068 239091
unique 3 4 57 202 1450 113 71709 9429 905 17 2180
top ONLINE Certified H-1B INDIA 11/19/2015 CALIFORNIA COGNIZANT TECHNOLOGY SOLUTIONS US CORPORATION New York Software Developers, Applications IT 6/30/2014
freq 112564 181933 283018 205158 1981 51127 12081 16971 114841 52617 767
In [11]:
df=df.drop(['application_type'],axis=1)
In [12]:
df = df.replace(np.NaN,'null')
In [13]:
df.dtypes
Out[13]:
case_status               object
class_of_admission        object
country_of_citizenship    object
decision_date             object
employer_state            object
employer_name             object
job_info_work_city        object
pw_soc_title              object
us_economic_sector        object
case_received_date        object
dtype: object

All the WHICH questions

Which state has the most VISA applications?

In [14]:
df['employer_state'].value_counts()[:10].plot(kind='barh').invert_yaxis() 

Which Job title has most applications ?

In [15]:
import plotly.graph_objects as go

fig = go.Figure(data=[go.Pie(labels=df['pw_soc_title'],
                             values=df['pw_soc_title'].value_counts()[:5])])
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict( line=dict(color='#000000', width=2)))
fig.show()

Which are the top 10 employer which submit applications ?

In [16]:
data['employer_name'].value_counts()[:10].plot(kind='barh').invert_yaxis() 

Which are the top 10 home countries of applicants?

In [17]:
df['country_of_citizenship'].value_counts()[:10].plot('bar')
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x173266dc9e8>

How many are approved ?

In [18]:
sns.countplot(df['case_status'],label='Count')
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x17328c187b8>

Data Transformation

In [19]:
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder() 
columns = list(df)
df1=df.copy();
for i in columns: 
    df1[i]= label_encoder.fit_transform(df[i])
df1.describe()
Out[19]:
case_status class_of_admission country_of_citizenship decision_date employer_state employer_name job_info_work_city pw_soc_title us_economic_sector case_received_date
count 374362.000000 374362.000000 374362.000000 374362.000000 374362.000000 374362.000000 374362.000000 374362.000000 374362.000000 374362.000000
mean 0.679732 22.467569 89.975601 750.219643 51.567913 34049.699588 4761.245153 520.473576 14.436839 1512.227042
std 0.803944 10.719500 41.691967 417.701984 34.639064 20372.251011 2583.788712 282.563502 4.652211 709.494545
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 19.000000 79.000000 391.000000 11.000000 16265.000000 2352.000000 196.000000 13.000000 930.000000
50% 1.000000 19.000000 79.000000 778.000000 50.000000 31201.000000 5169.000000 546.000000 17.000000 1726.000000
75% 1.000000 19.000000 85.000000 1107.000000 83.000000 51197.000000 6996.000000 803.000000 17.000000 2180.000000
max 3.000000 57.000000 202.000000 1449.000000 113.000000 71709.000000 9429.000000 905.000000 17.000000 2180.000000
In [20]:
df['decision_date']= pd.to_datetime(df['decision_date']) 
In [21]:
import matplotlib.ticker as mtick
df.assign(dummy = 1).groupby(
  ['dummy','case_status']
).size().groupby(level=0).apply(
    lambda x: 100 * x / x.sum()
).to_frame().unstack().plot(kind='bar',stacked=True,legend=False)
plt.xlabel('case_status')
plt.xticks([])
current_handles, _ = plt.gca().get_legend_handles_labels()
reversed_handles = reversed(current_handles)
correct_labels = reversed(df['case_status'].unique())
plt.legend(reversed_handles,correct_labels)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
In [22]:
df.groupby(['country_of_citizenship','case_status']).size().nlargest(20).unstack().plot(kind='barh',stacked=True).invert_yaxis()
plt.show()
In [23]:
df['decision_date'] = pd.to_datetime(df['decision_date'],infer_datetime_format=True)
plt.clf()
df['decision_date'].map(lambda d: d.month).plot(kind='hist',stacked = True,edgecolor='Black',linewidth=2)
plt.xlabel('months')
plt.ylabel('Frequency')
plt.show()
In [24]:
df2=df[df['case_status']=='Certified']
df2['decision_date'] = pd.to_datetime(df2['decision_date'],infer_datetime_format=True)
plt.clf()
df2['decision_date'].map(lambda d: d.month).plot(kind='hist',color='g',alpha=0.5,edgecolor='Black',linewidth=3)
plt.xlabel('months')
plt.ylabel('Frequency')
plt.show()
In [25]:
df['decision_date'] = pd.to_datetime(df['decision_date'],infer_datetime_format=True)
plt.clf()
df['decision_date'].map(lambda d: d.month).plot(kind='hist',stacked = True,label='All Status',edgecolor='black',linewidth=1)
df2=df[df['case_status']=='Certified']
df2['decision_date'] = pd.to_datetime(df2['decision_date'],infer_datetime_format=True)
df2['decision_date'].map(lambda d: d.month).plot(kind='hist',color='#FF3333', alpha=0.9,label='Certified',edgecolor='black',linewidth=1)
plt.xlabel('Month')
plt.ylabel('Frequency')
plt.legend(loc='upper right')
plt.show()

Data Modeling

In [26]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from numpy import sort
seed=123
In [27]:
X=df1.drop(['case_status'],axis=1)
Y=df1['case_status']
In [28]:
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
In [29]:
model = XGBClassifier()
model.fit(X_train, y_train)
Out[29]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
In [30]:
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
In [31]:
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
Accuracy: 83.33%
In [32]:
print(model.feature_importances_)
[0.0451639  0.0318288  0.59917647 0.07098073 0.00655812 0.01339766
 0.09256618 0.02274622 0.11758186]
In [33]:
plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
plt.show()
In [34]:
from xgboost import plot_importance
plot_importance(model,height=0.8, grid=False,edgecolor='black',linewidth=1,color='xkcd:salmon')
plt.show()
In [35]:
fig, (ax1,ax2,ax3)  = plt.subplots(3,figsize=(10,10))
plot_importance(model,height=0.8, grid=False,edgecolor='black',linewidth=1,color='xkcd:salmon',ax=ax1)
df['decision_date'] = pd.to_datetime(df['decision_date'],infer_datetime_format=True)
df['decision_date'].map(lambda d: d.month).plot(kind='hist',stacked = True,label='All Status',edgecolor='black',linewidth=1,ax=ax2,legend=True)
df2=df[df['case_status']=='Certified']
df2['decision_date'] = pd.to_datetime(df2['decision_date'],infer_datetime_format=True)
df2['decision_date'].map(lambda d: d.month).plot(kind='hist',color='#FF3333',alpha=0.9,label='Accepted',edgecolor='black',linewidth=1,ax=ax2,legend=True)
ax2.set_xlabel('decision month')
ax2.set_ylabel('frequency')
df3=df.copy()
df3 = df3.replace('null',np.NaN)
df3.dropna(subset=['case_received_date'])
df3['case_received_date'] = pd.to_datetime(df3['case_received_date'],infer_datetime_format=True)
df3['case_received_date'].map(lambda d: d.month).plot(kind='hist',stacked = True,label='All Status',edgecolor='black',linewidth=1,ax=ax3,legend=True)
df3=df3[df3['case_status']=='Certified']
df3['case_received_date'].map(lambda d: d.month).plot(kind='hist',color='#FF3333',alpha=0.9,label='Accepted',edgecolor='black',linewidth=1,ax=ax3,legend=True)
ax3.set_xlabel('case received month')
ax3.set_ylabel('frequency')
plt.show()
fig.savefig('results.png',bbox_inches='tight')
In [ ]:
thresholds = sort(model.feature_importances_)
In [ ]:
from sklearn.feature_selection import SelectFromModel
for thresh in thresholds:
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    select_X_test = selection.transform(X_test)
    y_pred = selection_model.predict(select_X_test)
    predictions = [round(value) for value in y_pred]
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
xgb.plot_tree(selection_model,num_trees=0)
plt.rcParams['figure.figsize'] = [50, 10]
plt.show()